Belief propagation processor

ABSTRACT

A processor includes a first memory module for storing a first set of storage values each representing a respective input, and a second memory module for storing a second set of storage values in analog form. An analog module is coupled to the first and the second memory modules. The analog module is configured to, in each operation cycle of at least one iteration, update at least some of the second set of storage values based on the first and the second sets of storage values. An output module is for generating a set of outputs from at least some of the second set of storage values.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a divisional of U.S. application Ser. No.13/079,204, filed Apr. 4, 2011, which is a continuation of PCTApplication No. PCT/US10/25956, filed Mar. 2, 2010, which claims thebenefit of U.S. Provisional Application No. 61/156,792, filed Mar. 2,2009, and U.S. Provisional Application No. 61/293,999, filed Jan. 11,2010. These applications are incorporated herein by reference.

This application is related to, but does not claim the benefit of thefiling date of, U.S. Provisional Patent Application Ser. No. 61/156,794,titled “Circuits for Soft Logical Functions,” filed Mar. 2, 2009, andU.S. Provisional Patent Application Ser. No. 61/156,721, titled “SignalMapping”, filed Mar. 2, 2009, and U.S. Provisional Patent ApplicationSer. No. 61/156,735, titled “Circuits for Soft Logical Functions,” filedMar. 2, 2009. This application is also related to, but does not claimthe benefit of the filing date of, U.S. application Ser. No. 12/537,060,titled “Storage Devices with Soft Processing,” filed Aug. 6, 2009. Thecontents of the above applications are incorporated herein by reference.

BACKGROUND

This document relates to an analog belief propagation processor.

“Belief Propagation” (BP) is an efficient approach to solvingstatistical inference problems. The approach exploits underlyingstructure of a network of stochastic elements and its constraints andBayesian laws of probabilities to find the most optimal set of validoutputs that satisfy constrains and network structure requirements.

Belief Propagation includes a class of techniques for performingstatistical inference using a system model that is in the form of agraph. The term “graph” here refers to the mathematical definition of agraph, which represents the connectedness of a set of abstract objects.The objects are often referred to as “nodes” and the connections betweenobjects are often referred to as “edges.” One common type of graph usedin such models is referred to as a “factor graph.” In a factor graph (aparticular style of factor graph called a “Formey factor graph”) thenodes represents statistical relationships between values, which arerepresented as edges. Other types of graphs, such as Bayesian networks,and Markov random fields are also commonly used for statisticalinference.

Examples of Belief Propagation approaches operate by passing messagesbetween nodes in the graph, where each message represents a summary ofthe information known by that node through its connections to othernodes. Such approaches are known by various names, including beliefpropagation, probability propagation, message passing, andsummary-product algorithms, among others. Particular forms of theseapproaches include sum-product, max-product, and min-sum.

A large variety of approaches to coding, signal processing, andartificial intelligence may be viewed as instances of thesummary-product approach (or belief/probability propagation approach),which operates by message passing in a graphical model. Specificinstances of such approaches include Kalman filtering and smoothing, theforward backward algorithm for hidden Markov models, probabilitypropagation in Bayesian networks, and decoding algorithms for errorcorrecting codes such as the Viterbi algorithm, the BCJR algorithm, andthe iterative decoding of turbo codes, low-density parity check codes,and similar codes.

Graphs on which belief propagation may operate include two types: graphswith loops (cyclic graphs) and graphs with no loops (acyclic graphs).Graphs with no loops are also known as “trees.” Belief propagationprocedures differ fundamentally between these two types of graphs. For atree, belief propagation approach can proceed in a well-defined orderwith a well-defined number of steps to compute the result. And assumingideal computation, this result is always known to be correct. For agraph with loops, on the other hand, belief propagation approaches aregenerally iterative, meaning the same set of computations must berepeated successively until a result is reached. In this case, thecomputation typically converges to a useful result, but does not alwaysdo so. In some cases, the computation may not converge to a singleresult, or if it does, the result in some cases is inaccurate. For acyclic graph, the performance of belief propagation can depend on theorder in which the computations are performed, which is known as themessage passing ‘schedule.’

In one particular application mentioned above, Belief Propagation hasbeen adopted as an efficient method of implementing decoders for variousforward error correcting codes. In this case BP uses structure of thecode and constraints to infer the correct valid codeword from the inputcodeword that contains noise, for instance, with each element (e.g.,bit) of the input codeword being represented as a distribution ratherthan a discrete value. In some implementations of Belief Propagation forforward error correction a Digital Signal Processor is used to performvarious arithmetic computations required by the algorithm with all thestatistical data being processed in digital format.

Observing the fact that “soft”—probabilistic data is continuous innature, i.e., represented by real values in a finite interval, it ispossible to implement belief propagation algorithm using analogelectrical circuits. Since only one signal is associated with the unitof statistical data rather than multiple signals for different digits(e.g., binary digits, bits) of the digital signal representing the samedata, the savings in hardware and power dissipation can be verysignificant.

Several architectures had been proposed that utilize analog circuits toperform efficient decoding of various codes, including convolutionalcodes, Low Density Parity Check Codes (LDPC) or linear block codes.These include analog implementations that use a so-called full flatarchitecture, where each input data symbol is associated with dedicatedcomputing element.

SUMMARY

In one aspect, in general, an analog processor has a first memory moduleand a second memory module. The first memory module is for storing afirst set of storage values in respective storage elements eachrepresenting a respective input to the processor. The second memorymodule is for storing a second set of storage values in analog form inrespective storage elements. The second set of storage values includesintermediate values determined during operation of the processor. Theanalog processor also includes an analog computation module coupled tothe first and the second memory modules. This processor is configurablesuch that in each of a set of operation cycles the analog moduledetermines values for at least some of the second set of storage valuesbased on at least some of the first and the second sets of storagevalues. An output module is use for generating a set of outputs from atleast some of the second set of storage values.

Aspects may include or more of the following features.

The first storage module is configured to store the first set of storagevalues in analog form.

The analog computation module is linked to the first and the secondmemory modules via analog signal paths. For example, the analog signalpaths are each configured to carry a value on a conductor represented asat least one of a voltage and a current proportional to the value.

The analog module is configurable to determine values for a differentsubset of the second set of storage values in each of a plurality ofoperation cycles.

The processor includes input selection circuitry configurable to couplethe analog computation module to outputs of selected memory elements ofthe first and the second memory modules.

The processor further includes, for each analog computation module, aplurality of signal busses, each bus providing an input value to theanalog computation module and being switchably coupled to a plurality ofthe storage elements of the second memory module.

The storage elements are coupled to switchably provide a currentrepresentation of a storage value stored in the storage element suchthat the input value provided to the analog computation module isrepresented as a current that is substantially proportional to a sum ofthe currents representations provided by the storage elements.

The processor further include output section circuitry configurable toaccept outputs of selected memory elements of the first and the secondmemory modules, and to determine outputs of the analog processor.

The processor includes multiple analog computation modules beingconcurrently operable to determines values for different subsets of thesecond set of storage values in each operation cycle.

The second memory module includes a plurality of section, eachassociated with a corresponding different one of the analog computationmodules for storing values determined by the associated computationmodule.

The second memory module is configured such that in a single operationcycle, each storage element can provide a storage value to one or moreof the analog computation modules and can accept a determined value tostorage in the storage element for providing in a subsequent operationcycle.

Each storage element is associated with two storage locations such thatin any one cycle, one storage location is used for accepting adetermined value and one storage location is used for providing a value.

The second memory module includes multiple memory sections. Groups ofthe sections form banks, wherein for each of the analog computationmodules each of a set of inputs to the module is associated with adifferent bank of the memory sections.

The processor is configured to implement a belief propagationcomputation.

The processor of claim is configured to implement a factor graphcomputation.

The processor is configured to implement a decoder for a low densityparity check (LDPC) code.

The processor further includes a controller configured to controloperation of the processor to perform a set of iterations ofcomputation, each iteration comprising a set of computation cycles.

The set of computation cycles is substantially the same in eachiteration, each cycle being associated with a configuration of the firstand the second storage modules to provide inputs and output to one ormore analog computation modules.

The processor is configured and/or configurable to implement a decoderparity check code, and each cycle is associated with one or more paritycheck constraints, and wherein the cycles of each iteration are togetherassociated with all the parity check constraints of the code.

The analog computation module implements a network of analog processingelements.

The analog processing elements include elements that represent softlogical operations. For example, the soft logical operations includesoft XOR operations.

The network of elements is acyclic.

The network of elements includes at least one cycle of elements, theanalog computation module being configured to implement an relaxationcomputation.

In another aspect, in general, a decoder includes a first memory forstoring code data having a length in bits, and a second memory forstoring intermediate data in analog form. The decoder includes an analogdecoder core coupled to the first memory and to the second memory. Thedecoder core has an input length less than the length of the code dataand an output length less than a number of constraints represented inthe code data. The decoder further includes a controller for, in each ofa set of cycles, coupling the inputs of the decoder code to selectedvalues from the first and the second memories, and coupling outputs ofthe decoder core for storage in the second memory. An output section ofthe decoder is coupled to the second memory for providing decoded databased on values stored in the second memory.

In another aspect, in general, a method is used for forming a datarepresentation of an analog processor. The method includes forming: adata representation of a first memory module for storing a first set ofstorage values in respective storage elements each representing arespective input to the processor; a data representation of a secondmemory module for storing a second set of storage values in analog formin respective storage elements, the second set of storage valuesincluding intermediate values determined during operation of theprocessor; a data representation of an analog computation module coupledto the first and the second memory modules, the processor beingconfigurable such that in each of a set of operation cycles the analogmodule determines values for at least some of the second set of storagevalues based on at least some of the first and the second sets ofstorage values; and a data representation of an output module forgenerating a set of outputs from at least some of the second set ofstorage values.

In some examples, forming the data representations includes formingVerilog representations of the processor.

The method can further include fabricating a integrated circuitimplementation of the analog processor according to the formed datarepresentation.

In some examples, the method further includes accepting a specificationof a parity check code and forming the data representations to representan implementation of a decoder for the code.

In another aspect, in general, software stored on a computer readablemedium includes instructions for and/or data imparting functionalitywhen employed in a computer component of an apparatus for forming anintegrated circuit implementation of any of the analog processordescribed above.

In another aspect, in general, decoding method includes, in each of aseries of cycles of a decoding operation, applying a portion of codedata and a portion of an intermediate value data to an analog decodercore, and storing an output of the decoder coder in an analog storagefor the intermediate data. Data, including intermediate value data fromthe analog storage, are combined to form decoded data representing anerror correction of the code data.

In some examples, each of the series of cycles is associated with acorresponding subset of less that all of a plurality of parity-checkconstraints of the code. The intermediate value data may include valueseach associated with a different one of the parity check constraints ofthe code.

In another aspect, in general, a processor includes a first memorymodule for storing a first set of storage values each representing arespective input, and a second memory module for storing a second set ofstorage values in analog form. An analog module is coupled to the firstand the second memory modules. The analog module is configured to, ineach operation cycle of at least one iteration, update at least some ofthe second set of storage values based on the first and the second setsof storage values. An output module is for generating a set of outputsfrom at least some of the second set of storage values.

The analog module may be configured for updating a different subset ofthe second set of storage values in each of at least two operationscycles of an iteration.

The analog module may include a set of distributed components eachconfigured to update a different subset of the second set of storagevalues using a different subset of the first set of storage values andthe second set of storage values.

In another aspect, in general, a decoder includes a first memory forstoring code data having a length in bits, and a second memory forstoring intermediate data in analog form. An analog decoder core iscoupled to the first memory and to the second memory, the decoder corehaving an input length less than the length of the code data and anoutput length less than a number of constraints represented in the codedata. A controller in the decoder is for, in each of a plurality ofcycles, coupling the inputs of the decoder code to selected values fromthe first and the second memories, and coupling outputs of the decodercore for storage in the second memory. An output section is coupled tothe second memory for providing decoded data based on values stored inthe second memory.

In another aspect, in general, a decoding method includes, in each of anumber of cycles of a decoding operation, applying a portion of codedata and a portion of an intermediate value data to an analog decodercore, and storing an output of the decoder coder in an analog storagefor the intermediate data. Data, including intermediate value data fromthe analog storage, is then combined to form decoded data representingan error correction of the code data.

Advantages of one or more aspects may include the following:

Use of analog computations and/or analog storage of intermediate valuesprovides lower power and/or smaller circuit area implementations ascompared to a digital implementations, for instance in applications ofiterative decoding or error correcting codes.

Iterative use of one or more analog computation cores provides lowerpower and/or smaller circuit area as compared to fully parallelrelaxation implementations of similar decoding algorithms. In someexamples, a partially relaxation implementation in which parts of acomputation are implemented in relaxation from in each of a successionof cycles may also provide similar advantages over a fully parallelrelation implementation.

Approaches are applicable to decoding of block codes without requiringthat the size and/or power requirements of an implementation growsubstantially with the length of the code.

Other features and advantages of the invention are apparent from thefollowing description, and from the claims.

DESCRIPTION OF DRAWINGS

FIG. 1 is an example factor graph for a length 8 LDPC code;

FIG. 2A is a diagram that illustrates transformation of a variable nodewith bidirectional links to a set of variable nodes with directed links,and FIG. 2B is a diagram that illustrates a similar transformation for aconstraint node;

FIG. 3 is a portion of the graph shown in FIG. 1;

FIG. 4 is a portion of a directed graph corresponding to the portion ofthe bidirectional graph shown in FIG. 3;

FIG. 5 is a diagram illustrating a module implementation correspondingto the portion of the graph shown in FIG. 4;

FIG. 6 is a diagram illustrating output calculation;

FIG. 7A is a diagram that shows a relationship between input and outputsor a module, and FIG. 7B illustrates the corresponding code matrix;

FIG. 8 is a diagram of an implementation of a decoder for a length 8LDPC code using a shared module;

FIG. 9 is a table that specifies inputs and outputs for the sharedmodule shown in FIG. 8;

FIG. 10 is a block diagram of a decoder with two shared modules;

FIG. 11 is a tabular representation of a parity matrix for a (1056, 352)LDPC code;

FIG. 12 is a diagram of a shared module for use with the code shown inFIG. 11

FIG. 13 is a block diagram of a decoder for a (1056, 352) LDPC code witheight shared modules (of which two are illustrated);

FIG. 14 is a circuit implementation of a variable node;

FIG. 15A is a circuit implementation of a constraint node;

FIG. 15B is an alternative implementation of a constraint node;

FIG. 16 is a diagram that illustrates a distributed bus implementationof a variable node;

FIG. 17 is a diagram of an alternative shared module;

FIG. 18 is a block diagram of a decoder that uses distributed busimplementations of variable nodes;

DESCRIPTION

Referring to FIG. 1, in one example of an analog-based implementation ofa belief propagation processor, a decoder for a Low Density Parity Check(LDPC) code is based on a factor graph 100 in which one variable node110 is associated with each different input bit (b_(j)), and one check(constraint) node 120 is associated with each constraint. In FIG. 1, anexample with eight input bits with four checks (constraints) on theinput bits is shown. The code can be represented in matrix form in whicheach column is associated with a different input bit, and each row isassociated with a different check or constraint. An (i, j) entry is 1 ifthe j^(th) input is used in the i^(th) constraint and 0 otherwise. Inthe LDPC example, the constraint is that the XOR of the inputs for aconstraint is 0. This example length 8 LPDC code can be representedaccording to the following check matrix (note that the rows aredependent modulo 2 in this illustrative example, which is notnecessarily true in general):

$\begin{bmatrix}0 & 1 & 0 & 1 & 1 & 0 & 0 & 1 \\1 & 1 & 1 & 0 & 0 & 1 & 0 & 0 \\0 & 0 & 1 & 0 & 0 & 1 & 1 & 1 \\1 & 0 & 0 & 1 & 1 & 0 & 1 & 0\end{bmatrix}\quad$

In FIG. 1, each edge is bidirectional. Referring to FIGS. 2A-B, anequivalent directed (unidirectional) graph can be formed by replacingeach n-edge node with n separate nodes, each of the n nodes having n−1inputs and one output, and forming unidirectional edges between thenodes to achieve the connectivity of the original graph. Referring toFIG. 2A, for instance, each 3-edge variable node 110 can be replacedwith three 2-input/1-output variable nodes 210, 212. Referring to FIG.2B, each 4-edge check node 120 can be replaced with four3-input/1-output check nodes 220.

One approach to analog implementation of a decoder corresponding to thefactor graph shown in FIG. 1 is to implement a circuit element for eachnode of the equivalent unidirectional graph. Referring to FIG. 3, aportion of the graph shown in FIG. 1 is illustrated showing check node 0(120), the bidirectional edges and variable nodes 1, 3, 4 and 7 (110)linked to that check node, as well as the other check nodes 1, 2 and 3(120) linked to those variable nodes. Referring to FIG. 4, a portion ofthe corresponding directed graph is shown in which check node 0 (120) isexpanded as four 3-input/1-output check nodes 220, for instance, labeled“0/1” to indicate that this is part of the expansion of check node 0with the output link coupled to variable node 1. Similarly variable node1 (110) is shown in its expansion into three 2-input/1-output nodes 210,212, for instance, labeled “1/0” to indicate that this is part of theexpansion of variable node 1 with the output link coupled to check node0, or labeled “1/out” to indicate that the output link provides anoutput of the factor graph.

In the example, which is partially illustrated in FIG. 4, a fullimplementation has four circuit elements for each check node (i.e., 16total expanded unidirectional check nodes 220), and three circuitelements for each variable node (i.e., 24 total expanded unidirectionalvariable nodes 210, 212). Out of the three circuit elements for avariable node, two (i.e., 16 total expanded variable nodes 210 for allvariable nodes) are used for message passing in an iterative stage ofdecoding operation, and one (i.e., 8 total expanded variable nodes 212for all variable nodes) is used for generating the decoder output (i.e.,the “belief”) in an output stage of decoding operation, as will bedescribed further below.

In operation, input signals y_(i) are used to determine correspondinganalog representations of input messages, which may be determined in asignal mapping circuit. In some examples, the inputs messages formrepresentations of the probabilities corresponding to bits b_(i), butthe reader should recognize that the discussion below with respect tocomputations involving representations of bit probabilities isillustrative of a particular form of input and internal messages thatare stored or passed during computation. These messages are provided tothe inputs of the variable nodes 210, for example, as outputs of analoginput registers 260. As discussed further below, in some embodiments therepresentations of the bit probabilities are provided as analog signalsfrom the input registers 260 encoding a (prior) log likelihood ratio(LLR) which is typically of the form

${\log ( \frac{\Pr ( {b_{i} =  0 \middle| y_{i} } )}{\Pr ( {b_{i} =  1 \middle| y_{i} } )} )},$

In the case of equal prior bit probabilities P(b_(i)=0)=P(b_(i)=1) isequal to

${\log ( \frac{P( { y_{i} \middle| b_{i}  = 0} )}{P( { y_{i} \middle| b_{i}  = 1} )} )}.$

In some examples, these bit probabilities are encoded as voltage orcurrent in single-ended or differential form (e.g., using a pair ofconducting paths for each unidirectional signal).

The approach partially illustrated in FIG. 4 is one of a number ofapproaches to implementation of a decoder corresponding to the graphshown in FIG. 1 that involve introducing an analog memory element 230 tobreak some or all cycles in the directed graph. In the approach shown inFIG. 4, the memory elements are introduced at the outputs of the checknodes. Other versions have such memory elements introduced at the outputof the variable nodes instead of or in addition to the memory elementsat the outputs of the check nodes. Note that in yet other embodiments,some or all cycles remain without memory elements, and operation is atleast partially based on a “relaxation” form of computation as signalspropagate through the cycles. In some embodiments, as combination ofrelaxation and memory based computation is used.

As illustrated in the example partially illustrated in FIG. 4, memoryelements 230 in this embodiment store values in analog form, and areintroduced at each output of the check nodes 220; that is, 16 memoryelements are introduced. For notational simplicity, these locations areindexed as (i, j) and labeled “Ci,j”, for the output from check node ithat is linked to variable node j. Note that each location correspondsto one of the non-zero entries in the check matrix of the code. The (i,j) memory location corresponds to the row i, column j, non-zero entry ofthe check matrix of the code.

In a number of approaches that make use of analog memory elements, thememory is introduced in the circuit implementation of the graph suchthere remain no cycles in the directed graph by breaking all cycles inthe directed graph. The circuit implementation is then operated in aseries of clocked cycles, such that at each cycle analog values readfrom some or all of the analog memory elements are propagated throughanalog circuit elements to inputs of some or all of the memory elementswhere they are stored at the end of the clock cycle. As discussed indetail below, such clocked (“discrete time”) implementation can be usedto decode with a result that is similar to that which would result froma relaxation (“continuous time”) implementation.

Referring to FIG. 5, another partial illustration of the example shownin FIG. 4 includes outputs of the four expanded check nodes 220associated with the original check node 0 (120). A circuit block 390forms an analog computation module that includes implementations of theexpanded variable nodes 1/0, 3/0, 4/0, and 7/0 (210) which have outputsto the four expanded check nodes 220. Note that check node 0 correspondsto row 0 of the matrix representation of the code, which is reproducedin FIG. 7B. Note that the outputs of the circuit block 390 correspond tothe memory locations row 0 of the matrix representation, as illustratedin FIG. 7A. The inputs of the circuit block 390 correspond to thenon-zero entries in each column of the matrix representation that has anon-zero entry in row 0, omitting those entries in row 0. In thisillustration, the inputs correspond to the non-zero entries in columns1, 3, 4 and 7 in rows 1, 2 and 3. This results in four memory cellinputs, C1,1, C3,3, C3,4 and C2,7, in addition to the inputs from theinput bit probabilities, B1, B3, B4, and B7.

An example of a full clocked circuit implementation of a decoder for thelength 8 LDPC has a memory element 230 at the output of eachunidirectional check node 220, and four copies of the circuit block 390,one corresponding to each row of the code matrix. In the first stage ofdecoding operation, each unidirectional variable node 210 (i.e., a totalof 16 circuit elements) takes its input from an output of a memoryelement 230, and one of the input bit probabilities 260. (Note that ingeneral for other size codes, the variable nodes are associated withmore than two check nodes, and therefore variable nodes would take asinput values from multiple memory elements). The memory cells 230 as awhole form a memory that is configured so that effectively all thevalues are updated at once at the end of each clock cycle. Oneimplementation of such a memory uses a “double buffering” approach inwhich two banks of memory are used, and in each clock period, one bankis read from and the other bank is written to, with the banks switchingrole between each clock period.

In some examples, the decoder may perform memory updates in successiveclock cycles, each clock cycle corresponding to a full update of allmemory cells of the memory 250. The number of clock cycles to beperformed in the first stage of decoding operation may bepre-determined, for example, based on design preference, or depend uponthe satisfaction of certain convergence conditions, for example,satisfaction of the code constraints (i.e., full error correction) or acondition based on a rate of change of output values between iterations.

Referring to FIG. 6, in some examples, once the iterations of memoryupdates are completed, the decoder proceeds to the output stage ofdecoding operation to generate decoder outputs representing bitestimates. Here, the decoder outputs are denoted as {circumflex over(b)}_(j), each being an estimate of a corresponding input bit (b_(j))based on the entire input signal. In some examples, as illustrated inFIG. 6, the variable node 212 outputs a message that includes arepresentation of the bit probability after decoding, for example, as anLLR, which can be considered to approximate

$\log ( \frac{P( {b_{i} =  0 \middle| y_{\backslash i} } )}{P( {b_{i} =  1 \middle| y_{\backslash i} } )} )$

where y_(i) denotes the observations not including y_(i). The output ofvariable node 212 is combined in a combination element 312 with theinput bit probability representation from input register 260 to form therepresentation of the bit probability based on all the inputs and theconstraints between the decoded bits. Recall that the output of inputregister 260 can be considered to represent

$\log ( \frac{\Pr ( {b_{i} =  0 \middle| y_{i} } )}{\Pr ( {b_{i} =  1 \middle| y_{i} } )} )$

and therefore the combined probability output from combination element312 is computed as a sum approximates

$\log ( \frac{\Pr ( {b_{i} =  0 \middle| y } )}{\Pr ( {b_{i} =  1 \middle| y } )} )$

where y represents all the input values. Optionally the combined bitprobability is passed through a hard decision, which in the case ofbinary outputs and logarithmic representations determines {circumflexover (b)}_(j) to take on the value of either 0 or 1 based on athresholding of the combined log likelihood ratio as either greater orless than zero. For example, the output element that uses memoryelements C0,1 and C1,1 and the input B1 to generate bit estimate{circumflex over (b)}₁. In some implementations, the set of eight outputelements may be configured to operate in a parallel fashion to generatethe full set of bit estimates {circumflex over (b)}₁ in a single clockcycle. Note that as illustrated in FIG. 6, elements 212 and 312 aredrawn as separate. However, each effectively computes a sum of itsinputs, and the two summations may be combined into a single circuitelement 315.

Referring to FIG. 8, in another example of a clocked circuitimplementation each of the nodes of the directed graph is not requiredto correspond to a different circuit element. That is, certain circuitelements form analog computation modules (“cores”) that are reusedmultiple times with different input and output connections (i.e.,shared) within each iteration. The functions performed by multiplemodules 390 in one clock cycle in the previous example are carried outin a series of clock cycles such that at in each of the series of clockcycles, only some of the memory elements 230 are updated, with all thememory elements being updated at the end of the series of clock cycles.Similarly, in the output stage of decoding operation, one or more sharedcircuit elements (e.g., element 315) may be reused in an output section395 for generating one or more bit estimates in each of a series ofclock cycles. In the discussion below, the entire series of clock cyclesthat updates all the memory elements in FIG. 3 is referred to as an“iteration.”

Continuing to refer to FIG. 8, a shared module 390 is coupled to inputselection circuitry 370 and output circuitry 380, which together provideinterfaces to the memory elements 230 in the memory 250. For example,the input circuitry 370 couples each input of a variable node 210 to theoutput of an appropriate memory cell 230 and to an appropriate inputregister 260, which collectively form an input memory module 265, andthe output circuitry 380 passes the outputs of the check nodes 220 tothe inputs of appropriate memory cells 230, which collectively form anintermediate memory module 250. In this example, the shared module 390includes all the variable nodes 210 and check nodes 220 needed tocompute all the outputs corresponding to one of the bidirectional checknodes 120 in the factor graph illustrated in FIG. 1. During eachsuccessive clock cycle of an iteration, the input circuitry 370 and theoutput circuitry 380 is effectively reconfigured to change theconnection of the variable nodes 210 and check nodes 220 to the memory250 and the input bits.

As an example of a multiple cycle iteration using the shared module 390illustrated in FIG. 8, the table shown in FIG. 9 illustrates theconfigurations during the four clock cycles of an iteration. Note thatthe configuration indicated for cycle 0 corresponds to the configurationillustrated in FIG. 5.

In some examples, multiple shared modules 390 are implemented in asingle integrated circuit. For example, the example shown in FIG. 8 maybe modified to have two shared modules, thereby providing eight newvalues for memory cells 230 in each clock cycle, with the iteration toupdate all the memory cells taking a total of two cycles (i.e., fourconstraints per iteration divided by two constraints per cycle yieldingtwo cycles per iteration). Similarly, in some examples, a shared modulemay update fewer cells, for example, updating only a single cell in eachclock cycle (i.e., using a single check node 220 and three variablenodes 210).

In the example illustrated above in FIG. 8, the updated values to thememory 250 are not passed through to the outputs of the memory untilafter the entire iteration is completed. In some examples, the updatedvalues determined in one clock cycle may be presented at the output ofthe memory during subsequence clock cycles within the same iteration. Insuch examples, the order in which the outputs of the check nodes arecomputed (the “schedule”) may be significant. Examples of schedulesinclude a sequential updating of the outputs associated with each of thecheck nodes 120 (see the factor graph in FIG. 1), and random updating inwhich different nodes are updated at each clock cycle.

Referring to FIG. 10, in some examples, multiple modules 390 are used(but not a sufficient number so that an iteration may be completed in asingle cycle), and the input selection circuitry 370, output selectioncircuitry 380, and memory 250, are distributed among a set of localprocessing elements 490, and each local processing element 490 has oneshared module 390. Each local processing element has a local outputcircuitry 480 and a local input selection circuitry 470. The memory isdistributed such that the memory cells 230 in the memory 450 of a localprocessing element are those cells that are updated by the shared module390 in the various clock cycles of an iteration. As illustrated, eachrow of memory cells is updated in one clock cycle. A control inputcontrols the configuration of the input and output circuitry accordingto the cycle in the iteration being performed. Note that in general, ashared module 390 at one local processing element 490 requires outputsof memory cells 230 in a local memory 470 of its own local processingelement and/or another (or more generally, one or more other) localprocessing element. The local input selection circuitry 470 selects thememory cells required by each of the local processing elements andpasses those values onto a global selection unit 440, which thendetermines the proper subsets of the memory values to be passed ontoeach one of those local processing elements. In the output stage, thememory cells are coupled through the selection circuitry 470 to theoutput section 495 to determine the outputs. The configuration shown inFIG. 10 can also be understood as the function of input selection logic370 shown in FIG. 8 is distributed among blocks 470 and 442, and theoutput logic 380 is distributed among the blocks 480.

In some examples, the global selection unit 440 may include a set ofselection units 442, each coupled to inputs of a respective localprocessing element to provide the corresponding subset of memory valuesto the shared module 390. For example, one selection unit 442 mayreceive 8 signals representing memory values provided by the two localinput selection circuitries 470 to generate four output signalsrepresenting the memory values to be provided to the local processingelement shown on the left of FIG. 10.

Referring again to FIG. 8, in some examples, the memory 250 as a wholeis configured such that effectively all the values are updated exactlyonce in an iteration. In one implementation of such a memory using a“double buffering” approach two banks of memory are used. In iterationk, the write circuitry always writes into memory bank #1, and the readcircuitry always reads from memory bank #2. By the end of iteration k,memory bank #1 has achieved a full update. In the next iteration k+1,the write circuitry switches to write into memory bank #2, and the readcircuitry reads from memory bank #1 which was just updated in the lastiteration. In this case, the memory 250 would need a capacity twice theamount of the outputs from the local check nodes to keep two differentcopies for read and write operations respectively.

The approaches described above in the context of a length 8 code isapplicable to a larger example of an (1056,352) LDPC code, such as isused in IEEE 802.16 based communication. The check matrix of the codecan be represented in tabular form breaking the 0,1 matrix into 8 rowsby 24 columns of 44 by 44 blocks, with each block being either all zero,or being an shifted diagonal with one non-zero entry in each row and ineach column. This tabular representation of the code is shown in FIG.11. The upper-left (0,0) block (showing the number of “0”) in thetabular representation is a diagonal matrix. The (0,2) block shown as a“8” is a off-diagonal block M=[m_(i,j)] such that m_(i,j)=1 if j=i+8(mod 44) and 0 otherwise. The full factor graph is not illustrated, butcan be derived from the matrix representation in the same manner as theexample illustrated in FIG. 1.

Referring to FIG. 12, a module 590 is configured to include variablenodes 510 and constraint nodes 520 for the code shown in FIG. 11. Notethat the nodes illustrated in FIG. 12 are unidirectional nodes in whichlinks are either input or output links. Module 590 is analogous tomodule 390 for the length 8 code discussed above. Note that each row inthe code matrix shown in FIG. 11 has ten non-zero entries in all rowblocks, except row block 6, which has eleven non-zero entries per block.In order to implement constraints outside row block 6, the module 590has ten (unidirectional) variable nodes 510 and ten (unidirectional)constraint nodes 520, and for rows in row block 6, eleven(unidirectional) variable nodes 510 and eleven (unidirectional)constrain nodes 520. Each variable nodes accepts inputs for memory cellscorresponding to non-zero entries in a particular column of the codematrix. Therefore, variable nodes corresponding to columns in the range0 through 15 have four inputs (three inputs for memory cellscorresponding to entries in the code matrix and one input for the bitprobability) and one output. Variable nodes for columns 16 through 23have two or three inputs depending on the column and the block row. Insome examples, the module 590 has the maximum number of variable nodesand inputs necessary, and is configurable during different cycles toaccommodate the specific number of variable nodes and inputs needed, forinstance, by ignoring certain inputs.

FIG. 13 illustrates one type of implementation of a decoder operable toperform the iterative stage of decoding operation for use with the(1056, 352) LDPC code shown in FIG. 11. In such an implementation, thedecoder includes an analog input memory 660 that stores representationsof the input bit probabilities (e.g., as voltages encoding loglikelihood ratios) corresponding to the 1056 bits (i.e., 24 blocks of44) of the LDPC code illustrated in FIG. 11. These input bits are thendistributed by memory selection circuitry 672 to be processed in a setof local processing elements 690. Each local processing element 690 hasa shared module 590 that includes the variable nodes and check nodesneeded to compute all the outputs corresponding to one of the checknodes of the full factor graph. The structure of each processing element690 is similar to each processing element 490 shown in FIG. 10 toimplement the decoder for a length 8 code.

Each local processing element 690 also includes a local output circuitry680, which directs the output of the local check nodes into appropriatecells 630 of a memory 650. In this example, the memory is distributedamong the set of local processing elements 690 as a set of localmemories 650, each of which includes memory cells 630 updated by theshared module 590 of its local processing element 690 (not other localprocessing elements) in the various clock cycles of an iteration. Asdescribed before, in general, each shared module 590 at one localprocessing element 690 requires outputs of memory cells in a localmemory 650 of its own local processing element and/or one or more ofother local processing elements. These outputs are obtained by a set oflocal read circuitry 670 that retrieve values from the local memory 650and send them to a global selection unit 640, which then determines theappropriate combinations of output values to be sent to the individuallocal processing elements 690 at various clock cycles. The globalselection unit 640 includes a separate input selection unit 642associated with each of the local processing elements, and provides asoutputs the values stored in the memories 650 required for input thatthat unit on each iteration.

Implementations of the type illustrated in FIG. 13 can have differentnumbers of processing elements, and use different schedules of applyingeach of the 352 constraints in different cycles of a decoding iteration.Referring back to FIG. 11 in the matrix representation of the (1056,352) LDPC, out of the total 8 blocks of rows, rows in 7 of the blocks ofrows (i.e., row blocks, 0-5, and 7) contains 10 non-zero entries, androw in one block (row block 6) each contains 11 non-zero entries.Therefore, of the 192 (i.e., 24*8) blocks of entries, only 81 arenon-zero, and each of the non-zero blocks has exactly 44 non-zeroentries, for a total of 3564 (81 times 44) entries.

The exemplary arrangement shown in FIG. 13 uses a set of 8 localprocessing elements 690 each configurable to perform the computationassociated with one check node (i.e., one row), of a corresponding blockof rows of the check matrix. In other words, each local processingelement 690 is used repeatedly in 44 cycles to compute the outputs forthe 44 constraints represented by the 44 rows of the code matrix in thecorresponding block. For example, the 8 elements apply constraints 0,44, 88, . . . , 308, respectively, on the first cycle, constraints 1,45, 89, . . . , 309, respectively, and finally 43, 87, 131, . . . , 351,respectively, on the last cycle of an iteration.

Note that because of differences in each of the row blocks in the codematrix, the shared module 590 in each local processing element 690 maybe have differences. Consider a shared module 590 for performing thecomputation associated with a row in the first (row block 0) block. Thecorresponding check node in the bidirectional graph has 10 edges linkedto variable nodes. Each of the first 8 variable node has five edges,four to check nodes and one to a bit input; the 9^(th) variable node hasfour edges, three to check nodes and one to a bit input, and the 10^(th)variable node has three edges, two to check nodes and one to a bitinput. The shared module 590 therefore has circuits for 10 (directional)check nodes, each with 9 inputs and one output. The 10 outputs of thecheck nodes update 10 locations in the local memory. The local processorhas circuits for 10 (directional) variable nodes 510, each with four,three, or two inputs and one output. Each node 510 provided an input to9 of the 10 (directional) check nodes 520. Of the inputs for eachvariable, one input is for an input bit probability and the remainderare for values from the local memories.

Shared modules 590 in the other local processing elements have the samestructure as that associated with row block 0, with differencesincluding the shared module 590 for row block 6 having 11 check nodes,and 11 variable nodes, and the share module 590 for blocks 1 through 5each having two variable nodes with two inputs and the other variablenodes having four inputs.

In operation, at each clock cycle, the variable nodes of shared module590 for row block 0 reads 10 sets of inputs from the input memory 660,one set for each variable node and updates 10 locations of the localmemory 650. The values from memory 650 are passed through the blocks 670of multiple of the local processing elements 690 and through the controlunit 642 associated with the destination processing element. Over 44clock cycles of an iteration, the shared module 590 provides updatedvalues for all 440 (44 times 10) locations in the local memory.

As outlined above, in some embodiments, each one of the shared modules590 may be implemented as a combination of 10 variable nodes and 10check nodes (also referred to as a 10×10 shared module), except for theshared module 590 for row block 6 which is implemented as a 11×11module.

A number of different circuit arrangements and signal encodings can beused within the approaches described above. For certain soft decodingapplications, each variable node circuit can be formed using a softEquals gate, and each check node circuit can be formed by a soft XORgate. In the example of FIG. 13, each variable node takes the form of a4-input (or 3-input or 2-input) soft Equals gate and each check nodetakes the form of a 9-input (or 10-input) soft XOR gate. Therefore, foreach shared module 590, besides reading the 10 (or 11) of input bits(one each soft Equals gate) from the input memory 660, it also requires10 (or 11) sets of values from the local memories 650 (one set for eachsoft Equals gate). Note that in this example these values come from thememory cells in the other local processing element(s), and not from thememory 650 in the same processing element.

One approach for providing the proper combinations of memory valuesneeded as input to the shared modules 590 includes forming, in theglobal control unit 640, a set of 8 individual selection units 642, eachof which selects or combines the outputs of the local processingelements 690 as needed for the input values for a corresponding sharedmodule 590. In some examples, each one of the read circuitry 670 isselectively coupled to the set of 8 selection units, for example, usinga set of 8 buses with each bus containing 10 (or 11) wires for sending atotal of 10 (or 11) output values to an individual selection unit in oneclock cycle. The selection unit 642 then chooses a set of 10×3 (or 11×3)output values for input to the shared module 690.

By arranging the decoder into local processing elements, in someembodiments, all of the XOR signals become local to the local processingelements in which they are formed. The inputs to the Equals gates becomeglobally routed signals that come from multiple local processingelements. In some examples, the local processing elements 690 can beconfigured in a way such that each shared module 690 requires onlyoutput values from a pre-defined set of three other local processingelements. As a result, the coupling between each local processingelement to the global control unit 640 can be reduced, for example, withread circuitry 670 now being coupled to only 3 (instead 8) selectionunits. In some examples, the local processing elements 690 can befurther arranged such that all of the even-numbered (i.e., 0, 2, 4, and6) local processing elements communicate with each other but not withthe odd-numbered (i.e., 1, 3, 5, and 7) local processing elements(except for the last eight block columns of the check matrix).

Note that, in some applications relating to soft decoding, the decoderdescribed above is used for converting input “soft” bits based onindividual measurements of each bit to soft bits each based on theentire block of soft bits, taking into account the constraints that theoriginal bits of the block satisfied. These output soft bits can then befurther processed, or converted by hard decision into output “hard” bitstaking values 0 or 1. The input soft bits may be provided in theprobability domain, for example, as the probability of a bit being valueof 1 or 0. Alternatively, the input soft bits may be provided in the logdomain, for example, as the log likelihood ratio of a bit (e.g., asdefined by

${\log ( \frac{p( {b_{i} =  0 \middle| y_{i} } )}{p( {b_{i} =  1 \middle| y_{i} } )} )}.$

In either case, the shared module 590 can be implemented using a set ofanalog circuit components that perform analog computation functionsappropriate for the particular application. Implementations of some ofthese analog circuit components (such as soft Equals and soft XOR) areillustrated in detail in U.S. Patent Application Ser. No. 61/156,794,titled “Circuits for Soft Logical Functions,” filed Mar. 2, 2009.

Referring to FIG. 14, in some implementations, the soft equals makes useof differential voltage inputs, each representing a log likelihood ratioto produce a voltage that is proportional to the sum of the inputs. Eachdifferential voltage input is passed through a voltage to currentconverter 712, and the resulting currents are summed on a bus 714. Thecurrent on the bus is passed through a current to voltage converter 716.The output voltage then branches to the soft XOR circuits that requirethe output of this equal node. Exemplary circuit implementations areshown in the figure. A variety of alternative circuits can be used,including alternative soft Equals circuits described in U.S. PatentApplication Ser. No. 61/156,794.

Referring to FIG. 15A, in some implementations, the soft XOR circuitsmake use of log domain differential voltages as produced by the circuitshown in FIG. 14. In the exemplary implementation of the soft XORcircuit shown in FIG. 15A, which approximates an ideal soft XOR functionfor log domain processing, one differential voltage input is passed to acircuit 812. The second and further inputs to the soft XOR circuit arepassed to circuits 814, each of which performs an analog computationthat approximates multiplication of the current provided by the previouselement according to that input. The resulting current approximates theideal soft XOR function each of the inputs and is passed through acurrent to voltage converter 816, to provide the differential voltageoutput of the soft XOR. Note that unlike the soft Equals circuit shownin FIG. 14, the output of the overall soft XOR circuit does not fan outon any particular cycle, because the output of the soft XOR circuitprovides the input to only a single memory cell. The circuit parameters,for instance, resistance values, transistor dimensions, and voltagescaling, are chosen to best approximate the ideal function of a soft XORand/or to optimize higher level (e.g., overall decoding) systemperformance.

Referring to FIG. 15B, an alternative arrangement 820 of circuitelements to the soft XOR circuit 520 shown in FIG. 15A implements thedirectional soft XOR circuit using a branching tree structure,optionally sharing signals between different trees associated with asame bidirectional check node. Specifically, groups of circuit elements818 effectively form two-input, one-output, voltage based soft XORcircuits using the circuit elements 812, 814, and 816 introduced withreference to FIG. 15A. These groups of circuit elements 818 are thenarranged in a tree structure, preferably a binary tree structure that isas balanced as possible to form the circuit arrangement 820 shown in thefigure. In some implementations, the branching structure shown in FIG.15B may have preferable characteristics, for instance, providing abetter approximation of the ideal soft XOR function with LLRrepresentations. Furthermore, when multiple modules 820 are implementedfor a set of unidirectional XOR circuits, certain computations can beshared, for example, by passing a signal 825 from one module to anotherwhere a portion of the tree in that other module can be eliminated.

Referring to FIG. 16, in some implementations, the bus 714 associatedwith each equal gate is distributed. Each memory cell 230, which storesits value as a differential voltage, has at its output a correspondingvoltage to current converter 912. These voltage to current convertersare similar to the converters 712 shown in FIG. 14. The current outputis passed to a set of switches 913, at most one of which is enabled ifthe corresponding cell's value is to be injected as a current on acurrent bus 714 corresponding to that switch 913. Each bus 714 similarlyincludes a portion onto which current associated with an appropriateinput bit is injected at each cycle to account for the input to theequal gate corresponding to the bit input. Note that the bus 714 foreach equal node may have a complex structure, for example, havingnumerous branches. Nevertheless, all the current injected onto the buspasses to the current to voltage converter 716 for the equal node, fromwhere it branches to the unidirectional XOR circuits that require thatoutput.

Referring to FIG. 17, in some implementations that make use of adistributed bus 714 to perform a current summation function, the module590 illustrated in FIG. 12 is replaced by a module 592 in conjunctionwith distributed busses 714 and voltage-to-current converters 712. Inmodule 592, each variable node corresponds to a current-to-voltageconverter 716, which outputs a voltage proportional to the totalinjected current on the corresponding bus 714, and then that voltagebranches to the appropriate check node circuits 520. Note that in yetother implementations, the current-to-voltage converters 716 arethemselves distributed, and a module 593 (i.e., a portion of module 592)receives voltage inputs, which are internally distributed to theappropriate check node circuits.

Referring to FIG. 18, a second example of an implementation of a decoderoperable to perform the iterative stage of decoding operation for usewith the (1056, 352) LDPC code shown in FIG. 11 provides the same orsimilar functionality to the implementation shown in FIG. 13. In thisexample, modules 592 are used, as illustrated in FIG. 17. The circuitrythat implements each soft Equals circuit associated with a variable nodeincludes a current-to-voltage converter in the module 592, with thedistributed busses corresponding to the 10 unidirectional variable nodesof the module 592 being distributed. Each memory 652 includes circuitryto inject current to the appropriate busses corresponding to the softEqual circuits for different variable nodes via read switching circuits671. The bus section 644 effectively includes 81 busses, each associatedwith a different current-to-voltage converter 716 at the input of amodule 592. Therefore, the soft Equal circuit is distributed in a mannereffectively forms interconnection paths between the memories 652 and theanalog computation modules 592.

It should be understood that the decoder applications described aboveare only one example of an application of an analog belief propagationprocessor. The techniques employed in these examples are applicable toother uses of belief propagation.

It is to be understood that the foregoing description is intended toillustrate and not to limit the scope of the invention, which is definedby the scope of the appended claims. Other embodiments are within thescope of the following claims.

What is claimed is:
 1. A decoder comprising: a first memory for storingcode data having a length in bits; a second memory for storingintermediate data in analog form; an analog decoder core coupled to thefirst memory and to the second memory, the decoder core having an inputlength less than the length of the code data and an output length lessthan a number of constraints represented in the code data; a controllerfor, in each of a plurality of cycles, coupling the inputs of thedecoder code to selected values from the first and the second memories,and coupling outputs of the decoder core for storage in the secondmemory; and an output section coupled to the second memory for providingdecoded data based on values stored in the second memory.
 2. The decoderof claim 1, wherein the first memory is configured for storing code datain analog form.
 3. A decoding method comprising: in each of a pluralityof cycles of a decoding operation, applying a portion of code data and aportion of an intermediate value data to an analog decoder core, andstoring an output of the decoder coder in an analog storage for theintermediate data; and combining data, including intermediate value datafrom the analog storage, to form decoded data representing an errorcorrection of the code data.
 4. The method of claim 3 wherein each ofthe plurality of cycles is associated with a corresponding subset ofless that all of a plurality of parity-check constraints of the code. 5.The method of claim 3 wherein the intermediate value data includesvalues each associated with a different one of the parity checkconstraints of the code.